-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Improve flights.*
dataset reproducibility
#645
Conversation
- (Locally) Using and adapted version of https://github.com/vega/altair/blob/main/pyproject.toml - Can add to the repo later if desired (#643 (comment))
- Uses [niquests](https://niquests.readthedocs.io/en/latest/index.html) for async - Also, see notes in `_write_rezip`
Very WIP Adapted some content from `flights.py`
Logging was hiding the actual code
- `_write_zip_to_parquet` has a detailed doc breaking down the benefits - Overall, much faster and requires less storage
Doc is now very explicit in what *cleaning* means here Resolves #645 (comment)
- short summary for `Flights.run` - examples for each constructor in class-level doc #645 (comment)
- moves `Flights.scan_sources` -> `SourceMap.from_specs` - rename `SourceMap.add_dependency` -> `SourceMap.add_spec` #645 (comment)
- Makes the doc visible in more places - Added a description of what `None` means per-extension - Fixed typo `"ISO 6801"` -> `"ISO 8601"`
flights.*
dataset reproducibilityflights.*
dataset reproducibility
Should I wait for the last todos to be done? |
Ah @domoritz I wasn't expecting you so soon! Yeah I'm done now, was just filling out the description to make reviewing easier.
EditYou should be able to simply run the script locally, without arguments. |
Great. @dsmedia can you do a detailed review? I will do a rough check. We don't need to run the scripts on the CI. |
Thanks @domoritz |
Haven't needed this alias since aede7f6
This looks great @dangotbanned. The pattern of downloading data from this API, transforming it, and generating clean sample datasets of various sizes is something that I could see being useful for analysis of this flights dataset well beyond the purposes of this repo. Given that broader potential, one suggestion that might add some flexibility to the valid row counts while keeping things clean: Instead of restricting row counts to specific numbers, type Rows = Literal[
1_000,
2_000,
5_000,
10_000,
20_000,
100_000,
200_000,
500_000,
1_000_000,
3_000_000,
5_000_000,
10_000_000,
100_000_000,
500_000_000,
1_000_000_000,
]
"""Number of rows to include in the output.""" we could allow any multiple of 1,000 for thousands and any multiple of 1,000,000 for millions. For example: def validate_row_count(n: int) -> bool:
if n >= 1_000_000:
return n % 1_000_000 == 0 # Must be exact millions
return n % 1_000 == 0 # Must be exact thousands This would let users choose numbers like:
The nice thing about this approach is that:
If this script gets adopted by other projects (which I could definitely see happening), this flexibility would be particularly valuable. Different projects might need different sample sizes - some might want 66k rows for a specific benchmark, others might need 676m rows to test large-scale processing. Having this flexibility while maintaining clean, predictable filenames would make the script more broadly useful. |
Thanks for taking a look @dsmedia!
I didn't do a job documenting this well here.
|
@property | |
def name(self) -> str: | |
""" | |
Encodes a short form of ``n_rows`` into the file name. | |
Examples | |
-------- | |
Note that the final name depends on ``suffix``: | |
| n_rows | stem | | |
| -------------- | ------------- | | |
| 10_000 | "flights-10k" | | |
| 1_000_000 | "flights-1m" | | |
| 12_000_000_000 | "flights-12b" | | |
""" | |
frac = self.n_rows // 1_000 | |
if frac >= 1_000_000: | |
s = f"{frac // 1_000_000}b" | |
elif frac >= 1_000: | |
s = f"{frac // 1_000}m" | |
elif frac >= 1: | |
s = f"{frac}k" | |
else: | |
raise TypeError(self.n_rows) | |
return f"{self._name_prefix}{s}{self.suffix}" |
So the 3 examples you gave all work currently (at runtime), but 999
fails:
Example code block
from scripts.flights2 import Spec, DateRange
date_range = DateRange((2001, 1), (2001, 12))
spec_0 = Spec(date_range, 66_000, ".parquet")
spec_1 = Spec(date_range, 123_000, ".parquet")
spec_2 = Spec(date_range, 676_000_000, ".parquet")
spec_bad = Spec(date_range, 999, ".parquet")
print(spec_0.name)
print(spec_1.name)
print(spec_2.name)
print(spec_bad.name)
Traceback
flights-66k.parquet
flights-123k.parquet
flights-676m.parquet
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[4], line 12
10 print(spec_1.name)
11 print(spec_2.name)
---> 12 print(spec_bad.name)
File ../vega-datasets/scripts/flights2.py:415, in Spec.name(self)
413 s = f"{frac}k"
414 else:
--> 415 raise TypeError(self.n_rows)
416 return f"{self._name_prefix}{s}{self.suffix}"
TypeError: 999
Admittedly though, all 4 of these equally trigger a type checker warning (statically) - which is not very intuitive:
The resolved environment for this script has a transient dependency which has some examples of runtime Annotated
validation:
I explored this a little with the DateTimeFormat
type.
There, I've used a very loose approximation of what would be accepted by chrono
DateTimeFormat
vega-datasets/scripts/flights2.py
Lines 131 to 184 in 8481618
def is_chrono_str(s: Any) -> TypeIs[_ChronoFormat]: | |
return s == "%Y/%m/%d %H:%M" or (isinstance(s, str) and s.startswith("%")) | |
def is_datetime_format(s: Any) -> TypeIs[DateTimeFormat]: | |
return s in {"iso", "iso:strict", "decimal"} or is_chrono_str(s) or s is None | |
type _ChronoFormat = Literal["%Y/%m/%d %H:%M"] | Annotated[LiteralString, is_chrono_str] | |
"""https://docs.rs/chrono/latest/chrono/format/strftime/index.html""" | |
type DateTimeFormat = Literal["iso", "iso:strict", "decimal"] | _ChronoFormat | None | |
""" | |
Anything that is resolvable to a date/time column transform. | |
Notes | |
----- | |
When not provided: | |
- {``.arrow``, ``.parquet``} preserve temporal data types on write | |
- ``.json`` defaults to **"iso"** | |
- ``.csv`` defaults to **"iso:strict"** | |
Examples | |
-------- | |
Each example will use the same input datetime: | |
from datetime import datetime | |
datetime(2020, 3, 1, 6, 30, 0) | |
**"iso"**, **"iso:strict"**: variants of `ISO 8601`_ used in `pl.Expr.dt.to_string`_: | |
"2020-03-01 06:30:00.000000" | |
"2020-03-01T06:30:00.000000" | |
**"decimal"**: represents **time only** with fractional minutes:: | |
6.5 # stored as a float | |
A format string using `chrono`_ specifiers: | |
"%Y/%m/%d %H:%M" -> "2020/03/01 06:30" | |
"%s" -> "1583044200" # UNIX timestamp | |
"%c" -> "Sun Mar 1 06:30:00 2020" | |
"%T" -> "06:30:00" | |
"%Y-%B-%d" -> "2020-March-01" | |
"%e-%b-%Y" -> " 1-Mar-2020" | |
.. _ISO 8601: | |
https://en.wikipedia.org/wiki/ISO_8601 | |
.. _pl.Expr.dt.to_string: | |
https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.to_string.html | |
.. _chrono: | |
https://docs.rs/chrono/latest/chrono/format/strftime/index.html | |
""" |
Question
@dsmedia would you be happy if I rewrite n_rows: Rows
more in the style of dt_format: DateTimeFormat
?
This would look like:
- Some more content in the docstring
- Replacing
Literal[...]
->Annotated[int, is_rows]
- Adding a runtime check for
Spec.n_rows
likeSpec.dt_format
- With the
TypeError
message explaining the constraint
- With the
Updated
@dsmedia hopefully I've addressed your concerns in (8ec3adf)
Since the validation logic started getting a bit unwieldy in Spec.__init__
, I've moved that to Spec._validate
.
There's also checks for valid columns and file extension in there as well
- deletes `flights.py` - renames `flights2.py` -> `flights.py`
Addresses:
flights-3m.parquet
#642 (review)flights-3m.parquet
#642 (comment)Description
This PR follows up the work in (#626, #642) to provide a more self-contained and reproducible version of (https://github.com/vega/vega-datasets/blob/5eaa256486deec59f3c4441d0abb568688ff5a81/scripts/flights.py).
Noteworthy changes
Specifications for each dataset are defined in
_data/flights.toml
flights.toml
vega-datasets/_data/flights.toml
Lines 1 to 57 in 951fe8c
Manually visiting transtats and downloading monthly data is no longer required.
Prior method
vega-datasets/scripts/flights.py
Lines 18 to 24 in 5eaa256
Any data that is missing becomes an asynchronous request, and is output as
.parquet
for efficient and fast storage.All you need to do is run the script.
New method
vega-datasets/scripts/flights2.py
Lines 783 to 797 in 951fe8c
vega-datasets/scripts/flights2.py
Lines 775 to 781 in 951fe8c
Switching from
pandas
->polars
.This is for both readability and performance.
The following two functions are roughly equivalent:
flights.process_flights_data
vega-datasets/scripts/flights.py
Lines 258 to 374 in 5eaa256
flights2.SourceMap.clean
Ignoring the docstring, it is much more concise and IMO a lot easier to understand.
vega-datasets/scripts/flights2.py
Lines 567 to 642 in 951fe8c
Tasks
flights.py
, renameflights2.py
->flights.py
flights.js
(05707d9)flights-1k.csv